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Evidence exists that reading kchieveitent can be 
measured simply and validly by having students\read alcud fcr one 
minute from vocabulary lists drawn from their ftasal reading series.. 
Direct and frequent measurement of student performance using this 
procedure provides a means for continuously evaluating a student's 
instructional program. The present stu^y investigated the effects of 
varyirg the size cf the population cf words from which test items for 
daily testing were sampled. Results indicated that grade-level lists 
were more sensitive to changes in .performance and t hat* across-grade 
lists produced less variability in performance. The size cf the word 
population did not seem to influence the ability of judces tc perform 
visual analyses of instructional effects. The implications of the 
findirgs fcr measurement and teaching are discussed. (Author) 
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Abstract | 
Evidence exist? that reading achievement can be measured simply 
and validly by having students read aloud for one minute frcm vocabu- 
lary lists drawn from their basal reading series. Direct and frequent 
measurement of student performance using this procedure provides a 
meaiv**Tor continuously evaluating a student's instructional program. 
The present study investigated the effects of varying the size of the 
population of words from which test items for daily testing were 
sampled. Results indicated that grade- level lists were more sensitive 
to changes in performance and that across-grade lists produced less 
variability in performance. The size of the word population did not 
seem to influence the ability of judges to perform visual analyses of 
instructional effects. Th N e implications' of the findings for measure- 

\ 

ment and, teaching are discussed. 



Daily Measurement of Reading* 
Effects of Varying the Size of the Item Pool 

Two activities inherent in instruction are observation of student 
performance and adjustment of instructional tactics based on those ob- 
servations. Topically, of course, teachers* observations are informal 
and tactical adjustments are unsystematically introduced. As the re- 
quirements for accountability increase, hov er, and as instructional 
designers attempt to improve instructional systems through educational 
technology, greater emphasis is placed on tests as a basis for observing 
student performance and evaluating program effectiveness. One effect 
of increasingly using tests in this way is the misuse of». commercially 
prepared standardized tests (Salvia & Ysseldyke, 1981). Tests designed 
for psychometric purposes are used as edumetric instruments ( Carver, 
1974) and poor fits between "what is taught" and "what is tested" occur 
(Jenkins & Pany, 1978; Skager, 1971). 

One alternative to commercially prepared achievement tests is direct 
observation and recording of student performance within the curriculum 
(Lov;Ltt,,Schaff , & Sayre, 1970; White & Haring, 1980). Assessment of 
performance within the curriculum in which the student is receiving 
instruction is an attractive alternative since it reduces the gap be- 
tween what is taught and what is tested. Further, the use of informal 
classroom measures may make it possible to tailor measurement to the \ 
individual student and the educational program, to measure student 
performance on a frequent basis, and to monitor and evaluate the effec- 
tiveness of instructional programs. 

Although the use of informal measures appears helpful in monitor- 



ing student^' performance and in evaluating instructional programs, 
a variety of technical questibns related to curriculum-based measure- 
ment need to be investigated. A first, critical question asks from 
what curriculum material it is appropriate to create the measurement 
' task. Within any given curriculum sequence, a decision must be made 
regarding the level of difficulty from which the stimulus materials 
will be sampled. In reading, for example, words and passages may be 
selected from instructional level, independent level, or frustration 
level (Mirkin & Deno, 1979). Intuitively, it would seem that material 
used for continuous perfonfiance assessment should be neither too diffi- 
cult (frustration level) nor too easy (independerf^evel) . If too diffi- 
cult, performance will be low and the rate of increase too slow, thus' 
precluding the use of data for evaluating the effects of instruction. 
Conversely, with the use of very easy material, v£ry fast growth may * 
occur in which student performance reaches a ceiling. In that event, 
changes must be made in the test stimulus for additional growth to be 
shown. If so, it will be difficult to determine the effect of a change 

4 

in instructional strategy since it will be confounded with change in the 
test stimuli. J 

Somewhere between the two difficulty extremes, stimulus material 
must be identified that may be used over a relatively long time period 
to reliably and sensitively monitor student progress and reflect the 
effects of changes in the instructional program. When repeatedly mea- 
suring student performance over time in this manner, the measurement 
items will have' to be kept constant, since changes in the testing pro- 
cedures (e.g., items) would be confounded with changes in the instruc- 
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tional program (Campbell & Stanley, 1963). For this reason, direct 
measurement of student performance on the daily instructional task, as 
often recommended (Lc ritt, 1967), may not be a workable solution to the 
question of what to measure in the curriculum. , 

The purpose of this study was to investigate how item selection 
in curriculum-based reading measurement impacts several technical 
characteristics of the measurement sy3tem. The measurement procedures 
investigated were based on research by Deno, Mirkin, ajid Chiang (1981) 
that established the validity fef reading aloud from basal text vocabulary 
for measuring reading achievement. A first concern was how the popula- 
tion of words from which items were sampled for daily measurement in- 
fluences the level and variability of student performance. A second . 
question was how the size of the population/of words influences the 
sensitivity of the daily measurement system for evaluating instructional 
programs. To investigate these issues, three measures were developed, 
differing only with respect to the size of the population of vocabulary 
words from which test items for daily testing were sampled. 

Method 

Subjects 

Five special education resource teachers in the Minneapolis Public 

Schools, who had volunteered to participate in the 4tudy, were asked 
/ 

to list students who were reading at the second, third, or fourth grade 
instructional level. Four students were Randomly selected from each 
teacher's list; these 20 students served as subjects in the study. 
Materials 

To develop daily measures of the stent's reading performance, the 
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following procedures were used* 

First, for each student three populations of reading vocabulary 
words were created using the Harris-Jacobson Word List (1972). The 
first and largest population, called Across-Grade list (AG), consisted 
of the entire pool of Harris-Jacobson words from the Preprimer-Grade 1 
through Grade 4. A second population, termed the Grade-Level list (GL) , 
consisted of all the Harris-Jacobs on words from within the student's 
grade level. The third, Instructional-Level list (IL), was a subset of 
200 words drawn at random from the GL population. The three population^ 
differed, then, in terms of the scope of reading vocabulary words in- 
cluded. The scope of the AG population was the largest and the scope 
of the IL population was the smallest. 

Daily word lists for testing were then created by drawing 60 words 
at random from each of the three populations. A different random sample 
from the respective domains was drawn each day to compose the daily 
test. Twenty word lists for each domain were created by random sampling 
with replacement. Therefore, the amount of repetition (words appearing 
more than once) from day to day within each list increased considerably 
from the Across-Grade list to the Instructional-Level list. Each 
teacher was given a set of 20 of each type of word list for every student. 
Procedures 

To determine an appropriate Grade-Level list in which to place the 
child, the student read aloud from the Grade-Level word lists for 
grades 1, 2, 3, and 4 for 30 seconds each. This procedure was repeated 
for five days. During this period, the teachers gave the Grade- Level 



word list reading tests without specific instruction on the words. 
The number of words read correctly and incorrectly on each of the 
four Grade- Level word lists was recorded daily, and the student was 
placed for instruction in the Grade-level population where the median 
number of words read correctly over the five days was the highest. 

Beginning the second week, the teachers initiated instruction 
for all their students. Each teacher was given the 200 Instructional- 
Level words that were drawn from the Grade-Level population in which 
the child had been placed. Each student was instructed individually 

for 10 miiutes daily ^ the 200 word Instructional- Level set. 

i 

Immediately following the instructional period the student took 
a 30-second word reading test on ea~h of the three populations of words 
using the daily test lists that had been created. The number of words 
r**d correctly and incorrectly on each type of word list was recorded 
by the teacher, and three daily performance graphs were created dis- 
playing correct and incorrect word reading. 

Throughout the course of the study, the performance graphs were 
evaluated to determine the amount of improvement in the student f s read- 
ing performance. Decisions were made weekly regarding whether to change 
a students program. Attempts were made to incorporate procedures spe- 
cific to that Student's graphed performance (e.g., if a student's error 
rate was high, a change might be made to include error correction or 
a response cost procedure to reduce errors) . In the event that five 
days of data were insufficient to reveal clear performance trends, the 
previous interventions were continued for two days and the judgment 
decision process resumed after the seventh day. A maximum of 15 days 
was allowed for keeping the same instructional format. When a decision 

10 



to change was made, the instructional intervention was implemented for 
another five dayo, after which the above procedure was applied again. 

Results 

Two primary analyses were conducted to assess the influence of che 
different word populations on the measurement data. The first analysis 
addressed the effects on the sensitivity of each test procedure to growth, 
and to variability in performance. The second analysis was conducted 
to assess the effects of the different item populations on evaluating 
changes in the instructional program. 
Differences in Measurement Characteristics 

Analysis of student performance on each type of daily test, from pre- 
test to posttest, indicated differential sensitivity as a function of the 
population from which the daily test was created (see Table 1). When the 
populations were compared with respect to the mean difference in number 
of words read correct from pretest to posttest (i.e., the mean of the 
last three days), a reliable difference was found between populations, 
with the difference greatest for the Instructional- Level lists, followed 
by the Grade-Level lists and the least gain occurring with the Across- 
Grade lists. When accuracy was analyzed, a greater gain in percent of 
words read correct was obtained on the Grade -Level lists than on the 
Across-Grade lists. In this analysis, however, no reliable difference 
in gain was obtained between the Instructional- Level and Grade-Level 
lists. 

Insert Table 1 about here 



7 



7 

In Table 2, the s^mi-interquartile range (i.e., one-half the 
difference between the 75th and 25th percentile scores) is presented 
for the word list tests drawn from each population. These semi-inter- 
quafctile ranges are presented for the fourth and twelfth instructional 
days. As can be seen, the obtained semi-interquartile ranges for 
all list scores remained quite consistent from Day 4 to Day 12, 
Differences in variability between lists also were quite consistent 
and small, with the GL lists the smallest on both Days 4 and 12, 
Variability also can be contrasted by examining the standard deviations 
presented in Table 1; the variability of the scores was consistently 
smaller on the tests created from the Across-Grade population of items. 

Insert Table 2 about here 
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Differences in Evaluation of Instruction 

A second analysis addressed the question of how weM the measures 
created from each/population could be used to evaluate changes in the 
instructional programs. To do so, the graphs of all students (3 per 
student) were randomly placed in folders (60 graphs) . The graphs were 
presented independently to four judges; folders allowed the judges to 
see only the student's actual performance, with no information regarding 
type of word list, or even scaling of the axes. 

Each judge was instructed to examine the student's performance in 
.relation to the introduction of new instructional strategies, and decide 
whether the intervention had an effect upon the student's performance. 
Judges were told to attend both to the number of words read correct and 
incorrect. An effect on performance was defined when the number of 
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words re2d correct increased and errors jtayed the same or correct 
stayed the same but errors decreased. Effects also were to include 
instances involving an increase in errors with no change in correct 
or ^ decrease in number of words read correct. Judges were instructed 
to attend to variability of performance, along with increases/decreases 
in either corrects or errors. One further definitional aspect involved 
the magnitude of change, that is, how much of an increase or decrease 
was needed to occur for an effect to be judged 4n arbitrary value 
of 2 to 3 words was used as the magnitude sufficient to consider an 
effect. — 

When the above definitional standards foi judging whether an inter- 
vention has an effect were used, a coefficient of concordance (agreements 
divided by agreements + disagreements) of .67 was attained for the first 
£wo judges; for judges three and four, a coefficient of .63 was obtained, 
While these coefficients are relative low, they are consistent with other 
published reports of this type of analysis. 

In Table 3, the percents of interventions deemed to have an effect 

are presented for each Word population and each judge. Table 4 presents 

the combined results for judges 1 and 2 and the combined results of 

judges 3 and 4. List population size was related reliably to the number 

of treatment changes judged effective by judges 1 and 2, both separately 

and combined. Chi square analysis revealed that judges 1 and 2 identi- 

2 

fied the lowest percent of apparent effects o~ AG scores (X » 6.2 for 
judges 1 and 2; X 2 - 6.02 for judge 1; X 2 ^ 5.96 for judge 2). When 
the same analysis was conducted using judges 3 and 4, this finding was 
not replicated. For judges 3 and 4, no reliable e r fects for different 
list populations were obtained. 
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Insert Tables 3 and 4 about here 



Discussion 

' The issue of item population is an important consideration in the 
development of a curriculum-based evaluation system. Not only might 
student performance vary as a function of the population from which 
samples are drawn, but the utility of using this type of data to evaluate 
instructional programs also may be influenced by the population size. 
It appears from this study that the best measurement system will be com* 
prised \f items which sample from the grade level at which the student 
is functioning although not necessarily the material in which instruction 
is being given. When a measure of student performance included words 
from the grade level in which the student was placed, growth was more 

id than when the words were drawn from a broader population of words. 
Further, weak evidence was obtained that the measurement systems based 
on the grade-level populations would produce performance graphs that 
would contribute more clearly to a visual analysis of Instructional 
effects. The lack of consistency in judges' use of the graphs of student 
performance to evaluate instruction is perplexing, however. One can 
only assume that either the Judges were inadequately trained, that the 
instructional interventions were not sufficiently powerful, or that the 
measures *'ere not adequately sensitive. At present, no basis exists 
for selecting one exple «i*:ion rather than another. 

One important finding of the study was that a daily measurement 
system may be developed for reading instruction that can be used over in 
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extended period of time without having to be revised or changed. To 
create such a system requires specifying a broad enough population 
so that a ceiling is not obtained and narrow enough so it is sensitive 
to performance change. The effect of this implication is that measure- 
ment and instruction can proceed in a compl imentary fashion without 

/ 

I undue domination by either. Teachers need not teach to the test nor 

) 

limit their testing to instructional units. 
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Table 1 

Mean Gain ln Performance from Pre to Posttest 



Measure 




List 






IL 


GL 


-AG 


Number of Words' 5 


f 

5.38 (3.50) 


3.73 (3.59) 


1.25 (2.98) 


Percent of Words C 


15.20 (13.53) 


19.68 (17.47) 


9.05 (12.19) 



a Entries are the means and standard deviations (in parentheses) of the 
differences between pretest and posttest. 

b Mean gain for the AG list was significantly different (£ < .05) from 
that <?n either the IL or GL Lists. 

°Mean gain for the GL list was significantly different (£ < .05) than 
on the AG list. 
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Table 2 

Seal-Interquartile Ranges of Performance on Each Test List 





Word List, 


Day 4 




Day 12 


I As t rue t ional-Leve 1 


4.10 


> 


4.05 


Grade-Level 


3.33 




2.99 


Across-Grade 


4.68 




5.18 
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Table 3 

Percentages of Interventions Judged to have Apparent Effects, Not Apparent Effects, 
or Not Enough Data as a Function of Judges and List Population 



List 

Population 




>i — 

1 

Effects Apparent 


Effects Not 


Apparent 




Not Enough Data 


r 


Jdg 1 


Jdg 2 Jdg 3 Jdg 4 


4dg 1 Jdg 2 


Jdg 3 Jdg 4 


Jdg 1 


Jdg 2 Jdg 3 


Jdg 4 




IL 


42 


51 65 49 


2(K 16 


24 29 


38 


33 11 


22 




GL 


51 


61 53 53 


40 32 


38 26 


9 


8 9 


21 




AG 


29 


36 62 51 


, 37 29 


28 28 


34 


35 10 


21 




a Dif ference 


between three list populations was 


significant (p_ 


< .05). 











u 
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Table 4 




Percentages of Interventions Judged 


to have Apparent Effects, 


Not Apparent Effects, and Not 




Enough Data by Judges 1 and 2 Combined and Judges 


3 and 4 Combined* 


List 

Population 


Effects Apparent 


Effects Not Apparent 


woe cnougn uaca 


Jdgs l&2 b Jdgs 3&4 


Jdgs 1&2 Jdgs 3&4 


Jdgs 1&2 Jdgs 3&4 


IL 


46 58 


18 27 


36 15 


GL 


56 53 


36 — 


8 17 


AG 


32 57 


33 / 28 


35 15 



*Interrater reliability was .67 for Judges 1 and 2 and .63 for Judges 3 and 4. 
Difference between three list populations was significant (£ < .05) . 
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